Multi‐Model Prediction of West Nile Virus Neuroinvasive Disease With Machine Learning for Identification of Important Regional Climatic Drivers

Abstract West Nile virus (WNV) is the leading cause of mosquito‐borne illness in the continental United States (CONUS). Spatial heterogeneity in historical incidence, environmental factors, and complex ecology make prediction of spatiotemporal variation in WNV transmission challenging. Machine learning provides promising tools for identification of important variables in such situations. To predict annual WNV neuroinvasive disease (WNND) cases in CONUS (2015–2021), we fitted 10 probabilistic models with variation in complexity from naïve to machine learning algorithm and an ensemble. We made predictions in each of nine climate regions on a hexagonal grid and evaluated each model's predictive accuracy. Using the machine learning models (random forest and neural network), we identified the relative importance and variation in ranking of predictors (historical WNND cases, climate anomalies, human demographics, and land use) across regions. We found that historical WNND cases and population density were among the most important factors while anomalies in temperature and precipitation often had relatively low importance. While the relative performance of each model varied across climatic regions, the magnitude of difference between models was small. All models except the naïve model had non‐significant differences in performance relative to the baseline model (negative binomial model fit per hexagon). No model, including the ensemble or more complex machine learning models, outperformed models based on historical case counts on the hexagon or region level; these models are good forecasting benchmarks. Further work is needed to assess if predictive capacity can be improved beyond that of these historical baselines.


Introduction
The following text, figures, and tables provide additional, supporting details for the analysis and results described in the main text.
• Text S1 outlines additional details of model development and methods for converting bootstrap samples to probabilistic forecasts.• Development and details of model fit for the autoregressive climate model (AR(1)-Climate) are also outlined in Text S1 and presented in Figures S1-S2 and Table S1.• Figure S3 illustrates the mean logarithmic score of the Always Absent model in relation to the other ten models illustrated in Figure 2. • Differences in the estimate marginal means of models relative to the NB-hex model by region are presented in Figure S4.Differences for all pairwise comparisons for which the 95% highest posterior density (HPD) interval between did not contain zero are included in Table S2.Differences in the estimated marginal means of models across regions for which the 95% HPD interval between did not contain zero are presented in Table S3.
• Variable importance for machine learning models fit with normalized anomalies in monthly climatic data is presented in Figure S5.• Median importance of lagged variables in machine learning models fit with monthly climate anomalies is presented in Figure S6.• Figure S7 illustrates the median proportion of summed connection weights positive vs.
the median rank of variables from the seasonal and monthly neural network models.This provides a visual summary of the relative importance and direction of the relationship between covariates and number of WNND cases.• A comparison of variable importance from random forest models fit with either raw or normalized anomalies in seasonal climate data presented in Figure S8.Table S4 compares the mean logarithmic score per climate region for these two models.
• Table S5 contains the top six important variables from the four machine learning models (RF-season, RF-month, NN-season, and NN-month).

Example of iterative model fitting and prediction
All models were iteratively trained and tested for prospectively predicting hexagon-annual WNND cases in 2015-2021.Predictions for each year were made with new models trained on all data prior to the prediction year of interest such that the training dataset included an additional year of data for predicting each following year.In other words, each type of model was trained on data through 2014 to predict WNND cases in 2015.New models were then trained on data that now included 2015, and these were used to predict cases in 2016.We continued this iterative procedure for annual predictions through 2021.
All models produce probabilistic predictions which were used for model evaluation (see Section 2.3 for details on the logarithmic score).Depending on the type of model, these were obtained from the fitted distribution (NB-hex, NB-region, and NB-nation models), prescribed by underlying assumptions of no cases (Always Absent naïve model), or bootstrap samples [AR(1), AR(1)-Climate, random forest (RF), and neural network (NN)].These bootstrap samples incorporated both model (training residuals) and prediction uncertainty and were used to obtain probabilistic predictions (see next section for chosen method).See Section 2.2.1 for discussion of the negative binomial and Always Absent models.

Calculation of probabilistic forecast from bootstrapped samples
We considered two methods to produce probabilistic forecasts from modeling frameworks that predicted case counts (i.e., the AR(1), AR(1)-Climate, random forest, and neural network models).For the first method, we used the proportion of bootstrap samples equal to the number observed cases as the corresponding probability for calculating the logarithmic score.For the second method, we took 23 quantiles of the bootstrap samples (1%, 2.5%, 5%, 10%, 15%, …, 85%, 90%, 95%, 97.5%, 99%) and fitted these to a negative binomial distribution.Then, using the fitted distribution, we calculated the probability corresponding to the observed number of cases.We selected the negative binomial distribution which minimized the squared difference between the quantiles of the sample and distribution.For each method, bootstrapped samples were converted from the prediction scale (ln(cases+1)) to counts prior to calculating the proportions (Method 1) or quantiles (Method 2).Similar logarithmic scores were obtained from both methods.Method 1 tended to produce lower scores (worse) than Method 2 because it did not produce smooth distributions across predicted case counts when not all outcomes in a particular range occurred in the bootstrapped samples.For these reasons, we chose to used Method 2 as our calculation method throughout.

Model fitting for negative binomial models
For both the NB-hex and NB-nation models, we used Stan models to perform MCMC sampling to fit the respective negative binomial distributions (i.e., estimate the mean (µ) and dispersion (φ) parameters) using normal (mean of zero and standard deviation of one) priors for µ and 1 √ .For all models, we ran four chains with a burn-in of 1,500 samples, and then collected an additional 1,500, resulting in 6,000 samples across the four chains.All models were checked for convergence.We used the median posterior estimate for the parameter values of the fitted negative binomial models.

Development of autoregressive climate model
To select the AR(1)-Climate model for each region, we fitted a series of 24 AR(1) models, each with a single exogenous climate covariate.We considered seasonal aggregations of anomalies in mean temperature and total precipitation for each season of the prediction year and previous year (i.e., winter, spring, summer, and fall) where Dec-Feb represented "winter", Mar-May represented "spring", Jun-Aug represented "summer", and "Sep-Nov" represented "fall".Seasonal anomalies were a subset of those calculated for use as covariates in the machine learning models (see Section 2.1 for details).As with the AR(1) model, we predicted the distribution of cases by generating 1,000 bootstrapped samples (sum of sample from prediction and randomly selected training residual) and fit a negative binomial model to the quantiles of the bootstrapped samples.We compared predictive accuracy of each candidate model across all hexes and predicted years (2015-2021) using logarithmic scoring to select the single best model which we designated as the AR(1)-Climate model.All models were fitted and predicted using the Arima function in the forecast package (Hyndman et al., 2022;Hyndman & Khandakar, 2008).See Figure S1 for the performance of each candidate model and region and Table S1 for the selected seasonal climate variable for each region.Figure S2 illustrates the fit coefficients for the AR(1)-Climate models.
The skill of the candidate models was very similar across prediction years with variation in the relative ranking by mean logarithmic score each year.Of the climate covariates selected in each climate region, we identified broad geographic trends.Broadly, anomalies in winter precipitation were selected in three contiguous climate regions (Southwest, South, and Ohio Valley) while anomalies in temperature were selected in the other six regions.More specifically, the anomaly in mean temperature during the previous spring (Mar-May) was selected in both the Upper Midwest and Northern Rockies and Plains regions while the anomaly in total precipitation in the previous winter (Dec-Feb) was selected in both the Ohio Valley and South regions.No other regions had the same climate covariate selected.

Cross-validation of random forest models
For the nested cross-validation of random forest models, we tried withholding either oneor two-year "windows" of training data.We did not observe significant differences between model fits, predictive performance, and important variables.For example, both window lengths resulted in similar in-and out-of-sample RMSE and MAE with variation across regions which window length minimized these metrics.We found small differences in the resulting mean logarithmic scores (range of difference ≤0.03).

Choice of climatic data
We chose to use normalized anomalies vs. the raw value of temperature and precipitation in models to reduce the correlation among variables and produce unbiases interpretation of the importance metrics from machine learning models.While machine learning models can produce robust predictions using highly correlated variables (Obite et al., 2020;Strobl et al., 2008), inclusion of these variables reduce the interpretability of importance metrics from the fitted model (Nicodemus & Malley, 2009).Since information is shared between such correlated variables, there is a reduction in apparent importance of these variables.As one of our goals was to identify highly important factors for prediction, we chose to use normalized anomalies.
To compare how this decision may have impacted our conclusions, we fitted an additional random forest (RF) model.For this model, we used seasonal mean temperature and total precipitation instead of normalized anomalies.All other variables were the same as the RFseason model (see Section 2.2.2 for details).We chose the RF-season model because it was identified as the "best" of the machine learning models overall (see Figure 3).
We did find some variation in the magnitude of variable importance between the RFseason model using climate anomalies vs. raw climate values (Figure S8).Across regions, variables for precipitation often had higher importance in the model with raw climate values than with normalized anomalies.In contrast, lagged WNND cases tended to have higher importance when climate anomalies were included.Overall, we estimated very similar mean logarithmic scores (Table S4), indicating similar predictive accuracy of the two model versions.S1 and Figure S2 for further details).Order of panels represent the relative geographic location of the nine climate regions (see Figure 1a).Seasons were defined as three-month aggregations of normalized anomalies in winter (Dec-Feb), spring (Mar-May), summer (Jun-Aug), and fall (Sep-Nov).Tmin: minimum temperature, Tmean: mean temperature, Precip: total precipitation, anom: normalized anomaly  1a) and subheadings indicate the normalized anomaly in the seasonal climate variable used per region (i.e., "best" covariate; see Figure S1).Seasons were defined as three-month aggregations for winter (Dec-Feb), spring (Mar-May), summer (Jun-Aug), and fall (Sep-Nov).Tmin: minimum temperature, Tmean: mean temperature, Precip: total precipitation.S5 for further details on variable importance.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure 1a).S5 for further details on variable importance.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure 1a).S5 for further details on variable importance.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure 1a).

Figure S8
. Importance of variables from random forests fitted with raw or normalized anomalies of seasonal climate data.Importance measured as median increase in mean squared error (MSE) when the variable was permuted.Diagonal grey line indicates 1:1 correspondence between importance of fitted models.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure 1a).0.18 † Seasons represented by three-month aggregations: winter (Dec-Feb), spring (Mar-May), summer (Jun-Aug), and fall (Sep-Nov).‡ Variable rank by median order of importance across all models.In each model, variables ranked by median relative importance (RF: increase in mean squared error; NN: relative magnitude of summed connection weights).In a few cases, median relative importance is not decreasing with decreasing rank.# Median increase in mean squared error (MSE) of prediction when variable permuted.+ Median relative magnitude (0-1) of summed connection weights.Precip: total precipitation; Temp: temperature; Cases: WNND cases

Figure S1 .
Figure S1.Mean logarithmic score of AR(1) and AR(1)-Climate candidate models per climate region.a) Mean logarithmic score per year (2015-2021) of AR(1) and AR(1)-Climate candidate models.Smaller scores (closer to 0) indicate better scores.b) Ranking of models based on mean logarithmic score by prediction year (descending rank).Text inset indicates climate covariate in the candidate model with the lowest mean logarithmic score overall (i.e., selected AR(1)-Climate model per region (bold line), see TableS1and FigureS2for further details).Order of panels represent the relative geographic location of the nine climate regions (see Figure1a).Seasons were defined as three-month aggregations of normalized anomalies in winter (Dec-Feb), spring (Mar-May), summer (Jun-Aug), and fall (Sep-Nov).Tmin: minimum temperature, Tmean: mean temperature, Precip: total precipitation, anom: normalized anomaly

Figure S3 .
Figure S3.Mean yearly logarithmic score of models for WNND prediction.Scores closer to zero indicate better performance.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure1a).

Figure S4 .
Figure S4.Median difference in estimated marginal means of predicted surprisal for models relative to the NB-hex model by climate region.Range indicates 95% highest probability density (HPD) of the difference.Comparison with Always Absent model omitted for scale (see TableS2).Increasing values of surprisal indicate worse forecast skill.
Figure S4.Median difference in estimated marginal means of predicted surprisal for models relative to the NB-hex model by climate region.Range indicates 95% highest probability density (HPD) of the difference.Comparison with Always Absent model omitted for scale (see TableS2).Increasing values of surprisal indicate worse forecast skill.

Figure S5 .
Figure S5.Variable importance for machine learning models fit with normalized anomalies in monthly climatic data.Random forest importance (y-axes) measured as median percent increase in mean squared prediction error (MSE) when variable permuted.Neural network importance (xaxes) measured as median relative magnitude of summed connection weights.See FigureS6and TableS5for further details on variable importance.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure1a).

Figure S6 .
Figure S6.Median importance of lagged variables in models fit with monthly climate data (normalized anomalies).Importance based on a) median increase in prediction error (mean squared error, MSE) when variable permuted in random forest or b) median relative magnitude of summed connection weights in neural network.Importance measured across all models and prediction years.Lags of zero-years indicate values from prediction year.See TableS5for further details on variable importance.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure1a).

Figure S7 .
Figure S7.Median proportion of summed connection weights positive from neural networks.Neural networks fit with a) seasonal or b) monthly climate data.Variables ranked by median relative magnitude of summed connection weights (see Figure 4).Points above horizontal line indicate most summed connection weights were positive while points below the line indicate most summed connection weights were negative.See TableS5for further details on variable importance.Order of panels represent the relative geographic location of the nine climate regions in the US (see Figure1a).

Table S1 .
Climate variable selected for AR(1)-Climate model per climate region.Normalized anomalies calculated by season using three-month aggregations.Winter: Dec-Feb; Spring: Mar-May; Summer: Jun-Aug; Fall: Sep-Nov.

Table S2 .
Difference in estimated marginal means for models in each climate region.Only contrasts between models where the 95% highest posterior density (HPD) interval did not contain zero are included.Negative differences indicate worse relative forecast skill of the second model in the contrast.

Table S3 .
Difference in estimated marginal means for models across all nine climate regions.Only contrasts between models where the 95% highest posterior density (HPD) interval did not contain zero are included.Negative differences indicate worse relative forecast skill of the second model in the contrast.

Table S4 .
Mean logarithmic score for WNND prediction from random forest models fitted with either raw or normalized anomalies in seasonal climate data.Scores closer to 0 indicate better performance.

Table S5 .
Top six variables per climate region.Random forest (RF) and neural network (NN) models trained with either lagged seasonal † or monthly climate data (normalized anomalies, lagged 0-5 years relative to the prediction year).